Audiovisual Tracking of Multiple Speakers in Smart Spaces

This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms a Viola & Jones face detector classifier kernel into a likelihood estimator, leveraging knowledge from multiple classifiers trained for different face poses. Additionally, we propose a mechanism to handle occlusions based on the new likelihood’s dispersion. The audio localization proposal utilizes a probabilistic steered response power, representing cross-correlation functions as Gaussian mixture models. Moreover, to prevent tracker interference, we introduce a novel mechanism for associating Gaussians with speakers. The evaluation is carried out using the AV16.3 and CAV3D databases for Single- and Multiple-Object Tracking tasks (SOT and MOT, respectively). GAVT significantly improves the localization performance over audio-only and video-only modalities, with up to 50.3% average relative improvement in 3D when compared with the video-only modality. When compared to the state of the art, our audiovisual system achieves up to 69.7% average relative improvement for the SOT and MOT tasks in the AV16.3 dataset (2D comparison), and up to 18.1% average relative improvement in the MOT task for the CAV3D dataset (3D comparison).


Introduction
Smart spaces are environments equipped with a set of monitoring sensors, communication, and computing systems. The primary objective of smart spaces is understanding people's behavior within them and their interactions, to improve human-machine interfaces. In this context, one of the required core low-level information is the presence, position, orientation (pose), and voice activity of users within the space, as these features play a significant role in high-level behavior and interaction understanding between the users and the environment.
Many of the approaches proposed in the literature for people localization within a monitored scene use information from a single type of sensor (video cameras [1][2][3], microphone arrays [4][5][6], infrared beacons [7][8][9], and others). Among them, the most used ones are video cameras and microphone arrays. In applications such as "smart" video conferencing, human-machine interfacing [10], automatic scene analysis [11], automatic camera tracking [12], and far-field speech recognition [13] it would be possible to locate and monitor speakers or audio sources within the space using the audio or video information. Additionally, providing accurate speaker positions could facilitate other tasks such as speech recognition [14].
Smart spaces are usually closed environments where the reverberation phenomenon is present, complicating the localization task [15] by audio means. Microphones are usually located in fixed positions on walls or ceilings. Therefore, the distance between the speech sources and the available microphones will lower the signal-to-noise ratio, making it difficult to locate the source [16]. Usually, video cameras are included in this kind of application to increase tracking accuracy as they provide additional information.
Nevertheless, video camera-only-based tracking systems also have their shortcomings. The uncertainties inherent to the acquisition system, the lighting conditions (brightness, shadows, contrast), and noise (sensor or optics characteristics) [17] decrease their accuracy in people pose extraction in real scenarios. Another problem in the vision-based systems is the total or partial occlusion of the subjects to be tracked [18] due to the light's directional nature. In this context, audio signals are not strictly directional, especially at low frequencies. Hence, they are the perfect complement to the visual tracking process when targets are occluded or out of the camera's field of view (FoV).
From the discussion above, it is clear that audio and video sources provide complementary information for people tracking in smart spaces. A combination of the best features from both sensors can thus improve the accuracy and robustness of the pursuit tracking process. This combined use of audio and video information in the tracking task is referred to as audiovisual tracking, in which the mouth is the element to track, as this is the area of maximum radiated acoustic power when speaking.

Previous Works
Given the increased availability of easy-to-deploy audio and video sensors and the improvements in computing facilities, in recent years, there has been a relevant growth in the number of proposals for multiple speakers tracking (Multiple-Object Tracking, MOT) in smart spaces, combining audio and video information [17,[19][20][21][22][23][24][25][26][27][28][29][30][31][32]. Qian et al. in [17] conducted an extensive review of state of the art in audiovisual speaker tracking. In their review, different literature proposals were classified according to various aspects like the space used to perform the tracking (image plane, ground-level or three-dimensional-3D), the configuration of the sensor location (co-located or distributed), the number of sensors, the tracking method, among others. This review highlighted the main research lines and open problems in the audiovisual speaker tracking field.
Along the review, one of the most relevant research lines focuses on tracking a variable and an unknown number of speakers [19,27,33], as this is a still not fully solved problem today. Another line focuses on finding a better visual appearance model to track multiple speakers in indoor environments [34]. At the same time, other proposals are centered on audiovisual tracking in compact configurations (co-located camera and microphone array) for applications such as human-robot interaction [17,20,21,35].
In any case, a clear tendency in the literature proposals is the predominance of probabilistic tracking models such as Bayesian Filters (BFs, i.e., mainly Kalman filters and particle filters) [17,20,29,31,32,34,36,37]. The reasoning behind the use of BFs is that they provide a natural and robust framework to combine different sources of information, in this case, diverse sensing systems.
The reduction of the number of sensors is another important tendency in the literature. In recent proposals, most of the approaches address the tracking task only with only one camera, and one microphone array [17,20,27]. Many of these proposed systems track speakers in the image plane (2D) [32,38]. However, proposals addressing the localization in the full 3D space can also be found in [17,20,24,27,29,31], thus imposing additional challenges analyzed below.
Finally, in the MOT context, it is always necessary to handle situations, like targets occlusions or closely located, where care must be taken in the association between measurements and targets [20]. These situations could be very challenging when audio or visual observations fail or give low-confidence likelihood, thus presenting an interesting set of proposals in the literature, as analyzed in the following sections.
Another general aspect is the recent rise of deep learning techniques in audiovisual speaker tracking. The most commonly used approach is based on an observation model that keeps the Bayesian scheme. There are many examples of face detectors based on deep learning [25][26][27][28][29]31], although Siamese networks have also been used to generate measures of particle similarity to previous reference images of each target [32,39], and fusion models based on the attention mechanism [32]. Fewer proposals we found with end-to-end trained audiovisual solutions as in [24,40] for object tracking in which visual and auditory inputs are fused by an added fusion layer.
This paper presents a complete contribution to robust audiovisual multiple-speaker 3D tracking in smart spaces. The proposed approach is thoroughly analyzed and compared using the best available datasets published for this purpose. We also compare our method against, to the best of our knowledge, the best proposals in the state-of-the-art, emphasizing our focus on signal recognition and pattern analysis. This distinguishes our work from recent proposals that primarily rely on deep learning techniques.

Visual Tracking
An important issue to analyze within the visual part of the tracking is how to locate the mouth position in the image plane. Within a smart space, speakers could be far from sensors, at distances up to several meters, where detecting the mouth is challenging, as its size can be just a few pixels on the image. Given its small size, usually, the face is first located, as it is easier to detect, and then an estimation of the mouth location within the face area is made. This strategy is used in several proposals [17] and gives a chance to facial feature estimation with pattern recognition methods.
Typically, audiovisual speaker tracking literature uses color histogram-based likelihood estimations to locate faces in images [19,26,41]. However, this technique has proved to be not very accurate [17,32]. More accurate techniques for face location within the image are based on the use of trained detectors like the classic Viola and Jones (VJ) [42], or more recent ones based on deep learning [17].
Regarding the estimation of the mouth position within the face area, most face detectors used in audiovisual tracking are independent of the face pose, as in [17,20]. This approach has limited accuracy for mouth position estimation, as it depends on the face pose relative to the camera. To partially overcome this issue, in [17,20] the mouth localization is solved considering the aspect ratio of the face detection bounding box (BB).
Location in a 3D space using a single camera poses an additional challenge in the visual tracking part, as it requires estimating the target depth based on the scene projection on the 2D plane. From the 2D mouth position estimation, Qian et al. in [20] proposed a 3D projection assuming the shoulders width in 3D as a known parameter. In [17], as faces are detected in 2D with the face detection MXNet [43] network, the 3D mouth position is derived assuming that the face detection BB is related to the distance between the target and the camera.
A third issue with the mouth localization is its integration into the probabilistic tracking framework. In some works, like [17], when a face is located in 2D, a BB is projected to the 3D space, assuming a predefined probability distribution around its 3D points. As a more realistic proposal, in [44] a likelihood function for the face location is proposed, built with the internal information extracted from a modified VJ detector.
Most proposals [19,20,33] were evaluated in the AV16.3 database [45] where people's faces were always visible on some of the available cameras (people do not turn their backs to all the cameras simultaneously). In [17], the more complex evaluation dataset CAV3D was presented, using a co-located audiovisual sensor setup, where faces are not always visible (sometimes people turn their backs to the camera), including strange poses and varied dynamic behaviors. This new condition implies that a face detector will not provide localization information as long as the camera does not see the faces, and thus another mechanism is required to allow the localization task to provide accurate estimations, generating another relevant challenge to be solved in the visual tracking task.
Some solutions implement a generative color histogram-based algorithm when the faces are not detected [17]. However, when complex contexts arise, with part of the target's movement happening outside of the camera's FoV (such as in CAV3D), the audio modality is the only one available, and no color-based approach can be used.

Audio Tracking
In audio localization, the most common approach is to compute the Generalized Cross-Correlation function (GCC) between pairs of microphones to generate an acoustic activation map with Steered Response Power (SRP) strategies, usually combined with the PHAT transform [20,[46][47][48][49][50][51][52][53][54].The audio likelihood model thus obtained is then associated with the SRP acoustic map. The spatial resolution of these methods strongly depends on the array geometry, and for small microphone arrays (short distance between microphones, compared to the search space area), SRP presents a wide active response (low resolution), mainly in radial distance from the source [17].
Another problem in audio localization is room reverberation, which generates multiple peaks in GCC. To alleviate this effect, the approach in [55] proposed to model GCC as a Gaussian mixture. This strategy allows associating only one GCC Gaussian (or peak) with the source.
For multiple-speaker tracking (MOT), the association between GCC peaks and speakers must be addressed, taking also into account that when more than one speaker is in the same direction relative to the microphone array position, the GCC peaks related to the different speakers are mixed. To address this problem, ref. [55] proposes to use an a priori distribution from a Kalman filter to assign each peak to one of the speakers in each pair of microphones, where the assignment is made independently for each pair of microphones.
In [29], the authors propose a phase-aware VoiceFilter and a separation before the localization method. They separate the speech from different speakers by first using VoiceFilter with phase estimation and then applying a localization algorithm. However, the method needs clean speech samples for each speaker in the training phase, thus limiting its applicability. In [32], a novel acoustic map based on a spatial-temporal Global Coherence Field (stGCF) map is proposed, which utilizes a camera model to establish a mapping relationship between the audio and video localization spaces.

Contributions
The main contributions of our proposal (named GAVT) are: (i) providing an audiovisual localization approach in which the visual strategy is based on exploiting the knowledge of a trained face detector, modified to generate a likelihood model. This approach will thus not depend on a standard face detection process; (ii) using a pose-dependent strategy that improves the 2D mouth location estimation, thanks to the likelihood model extension with a new in-plane rotation and an image evaluation exploration; (iii) proposing a mechanism to handle MOT tasks for visual observations dependent on the new likelihood dispersion; and finally (iv) presenting a novel mechanism to avoid target interference in audio tracking that selects the more adequate GCC peak for each target, based on the joint distribution of all pairs of microphones.
The evaluation will be done on the AV16.3 and CAV3D datasets, both in single and multiple-speaker scenarios, to provide realistic and comparative quantitative and qualitative results and contributions within all the challenges exposed in this state-of-theart revision.
The remainder of this paper is structured as follows: Section 3 describes the notation used and the definition of the multi-speaker tracking problem. Section 4 describes the general scheme of our proposal, while Sections 5 and 6 detail the proposed video and audio observation models, respectively. The experimental setup and results obtained are described correspondingly in Sections 7 and 8. Finally, Section 9 presents the paper conclusions.

Notation and Problem Statement
Real scalar values are represented by lowercase letters (e.g., α, c). Vectors are represented by lowercase bold letters (e.g., x). Matrices represented by uppercase bold letters (e.g., M). Uppercase letters are reserved to define vector and set sizes (e.g., vector y = (y 1 , . . . , y M ) T is of size M). Calligraphic fonts are reserved to represent sets (e.g., R for real or G for generic sets). The l p norm (p > 0) of a vector is depicted as . p , e.g., x p = (|x 1 | p + · · · + |x N | p ) 1/p , where |.| is reserved to represent absolute values of scalars or the module operation for complex values. The l 2 norm . 2 (Euclidean distance) will be written by default as . for simplicity. The discrete Fourier transform of a discrete signal x[n] is represented with the complex function X[ω], with X * [ω] being the complexconjugate of X[ω]. Along the paper, we will use the a , v , and av superscripts to refer to elements belonging to the audio, video, and audiovisual modalities, respectively. Moreover, the tilde (˜) inX refers to the projection of X in a different space.
Within the notation context defined above, let us consider an indoor environment with a set of N M microphones M = {m 1 , m 2 , . . . , m N M }, where m ν is a known threedimensional vector m ν = (m νx , m νy , m νz ) T denoting the position of the ν th microphone from the reference coordinate origin. For processing purposes, the microphones are grouped in pairs, as elements in a set Q = {π 1 , π 2 , . . . , π N Q }, where π j = m j 1 , m j 2 is composed of two three-dimensional vectors m j 1 and m j 2 , (m j 1 , m j 2 ∈ M, with m j 1 = m j 2 ) that describe the spatial location of the microphones in the pair j. If all microphone pairs are allowed, Given this setup, let us assume that there is a set of N S acoustic sources S = {r 1 , r 2 , . . . , r N S }, where r i is a known three-dimensional vector r i = (r ix , r iy , r iz ) T , emitting N S acoustic signals x i (t), which are received by each microphone m ν obtaining N M time signal s ν (t) according to the propagation model in Equation (1): with h i,ν (t) being the Room Impulse Response (RIR) between the acoustic source position r i and the ν th microphone, * the convolution operator, and η i,ν (t) a signal that models all the audio signal adverse effects not included in h i,ν (t) (noise, interference, etc.).
In an anechoic (free-field) condition the signals received by each microphone are just a delayed and attenuated version of the acoustic source signal, as shown in Equation (2): where τ i,ν (r i ) = c −1 r i − m ν is the propagation delay between m ν and r i , τ i,ν (r i ), α i,ν = 1 4πcτ i,ν (r i ) is a distance-related attenuation assuming spherical propagation [56], and c is the sound propagation velocity in air.
The environment is also equipped with a set of N C cameras C = {c 1 , c 2 , . . . , c N C }, giving the corresponding I θ (t) images, where c θ is a known three-dimensional vector c θ = (c θx , c θy , c θz ) T , denoting the position of the θ th camera from the reference coordinate origin. The K θ intrinsic calibration matrix for each camera is also available, with θ = 1 . . . N C .
In the geometrical discussions, we will refer to the 3D space as that in which every point is defined by its 3D coordinates (x, y, z) T . The visual observation space will be defined in the 2D plane (image plane) as (u, v, s) T , where (u, v) are the pixel coordinates in the image plane, and s refers to the size of the explored image plane.
The proposal in this work will be exploiting the formulation of Bayesian Filters (BFs), which are techniques that estimate the posterior probability density function (PDF) of a system state whose dynamics are statistically modeled along time, given a set of observations or measurements [57], also statistically modeled.
In these models, the state (x n ) and observation (z n ) vectors can be mathematically obtained along time, being n the corresponding time instant. The state vector x n characterizes the properties of the system to be estimated (e.g., position, velocity, physical dimensions), and the observation vector z n considers the measurements from the system behavior with all sensors in the environment.
In the BF framework, the estimation is made in two steps: prediction and update (also referred to as correction [57]). In the prediction stage, a prior PDF of the state vector is computed given its previous value p(x n−1 |Z n−1 ) and using the state model p(x n |x n−1 ), as shown in Equation (3). In the update stage, a new posterior PDF is computed p(x n |Z n ) (see Equation (4)) from the prior obtained in the prediction stage, and by including there the current observation vector information z n , through the observation model p(z n |x n ) and its likelihood p(z n |Z n−1 ).
where Z n = {z 1 , z 2 , . . . , z n }. Particle filters (PFs) are a particular class of BFs that approximate the state distribution with a set of weighted samples {x i n , w i n /i = 1, . . . , N P }, called particles, that characterize each estimation hypothesis, as shown in Equation (5): where N P is the number of particles, and w i n are the weights characterizing the probability of every given particle x n or state value hypothesis.
The update stage is then carried out by applying Importance Sampling [57] (IS, and its derivation Sequential Importance Resampling, SIR), which is a statistical technique to estimate the properties of a posterior distribution p(x n |Z n ), when its samples are generated from another sampled one, the a prior PDF in this case p(x n |Z n−1 ).
In this work, PFs are used for tracking the mouth position of multiple speakers in a smart space using audio and video information, so that the observation vector z n will be composed of observations from the audio modality z a n and from the video modality z v n , so that z n = (z a n , z v n ) T .

Audiovisual Tracking: The GAVT System Proposal
The tracking task will be thus carried out using only one video camera and one small microphone array in both co-located and distributed scenarios. It is assumed that the sensors are calibrated, and the audio and video signals are synchronized. It is also considered that there are a constant (and known) number of speakers (targets) in the space and that speakers do not stop talking for long periods. The problem complexity is thus concentrated in the multimodal 3D localization and tracking task in a multiple-speaker context, within a probabilistic approach without a reidentification process to solve the possible association problems that may appear in such context. Within this framework, and following the notation and the problem statement previously described, in this section we are presenting the global system proposed for addressing the audiovisual tracking task.

General Architecture
In our proposal, each target speaker or objective o, will be characterized at any time instant n by a state vector x n,o = (p n,o ,ṗ n,o ) T , where p n,o = (x n,o , y n,o , z n,o ) T is the speaker 3D position, andṗ n,o = (ẋ n,o ,ẏ n,o ,ż n,o ) T its velocity.
A separated particle filter is used for tracking each target o, with the SIR algorithm, using a set of N P particles (N P is assumed to be fixed for all targets and time instants). Each particle i (i = 1 . . . N P ) at any time instant n will be characterized by a state vector Then, during the update stage at time n, audio and video data z av n = (z a n , z v n ) T are used to evaluate, for every target o and predicted particle i, the audiovisual likelihood l av (p i n,o ), combining the likelihoods calculated from the audio and video modalities, l a (p i n,o ) and . Likelihood calculations are carried out for each predicted particle using the corresponding target observation model p(z n,o |x n,o ).
For every o target, the particles' weights {w 1 n,o , w 2 n,o , . . . w N P n,o } are obtained, conforming the sampled version of the likelihood p(z n,o |Z n−1,o ) at the PF standard framework correction step in Equation (4).
After the update process, the particle set is resampled with the multinomial resampling method proposed in [58]. This way, particles with low weights are eliminated and those with high weight are replicated, keeping constant the number of particles N P used to characterize the state hypothesis of each estimation of the speaker o position at that time instant.
As a final result, we obtain the new particle set P n, n,o }, that constitutes the sampled version of the posterior PDF p(x n,o |Z n,o ). This final new particle set will be the one used as the prior distribution in the next time step.
The general scheme described above is graphically presented in Figure 1.

Prediction
The state model used at each iteration of the PF to propagate the particles is the Langevin motion model [59], commonly used in the acoustic speaker tracking literature [60], as shown in Equation (6): where A and Q are the transition and noise state matrices, respectively, u ∼ N (0, Σ) is the noise related to the process or state, for which a normal distribution is assumed, with zero mean 0 and Σ covariance matrix.
Matrix A corresponds to a first-order motion behavior, and it is described in Equation (7): Moreover, the coefficients of Q matrix are given by Equation (8): where ∆ T is the time interval (in seconds) between frames n and n − 1, = e −β∆ T and ζ =v √ 1 − 2 are the process' noise model parameters, and diag(·) is a diagonal matrix with the diagonal values being its arguments. The control parameters used in the proposal are a steady-state velocityv and its velocity rate of change β, following the formulation in [59].

Update and Position Estimation
To fuse the audiovisual information generated from the audio and video sources, we assume independence between these modalities, so that they fulfill Equation (9): In practice, this means that the audiovisual likelihoods are obtained by computing the product of the likelihoods from both modalities, as shown in Equation (10): Then, following the sampled background of PFs presented in Equation (5), weights are updated by their likelihood values as shown in Equation (11): Finally, the most probable state for each o target (x n,o ), thus the deterministic instantaneous value for p(x n,o |Z n,o ), is estimated evaluating Equation (12) [58].

Video Observation Model
The visual likelihood l v (p i n,o ) of the video observation model in the real 3D coordinate system, consists of three blocks, as shown in Figure 2, and explained below: The first block is an appearance-based observation model based on the VJ likelihood l V J (p i n,o ). Using the probabilistic version of the VJ detector, this appearance-based algorithm accurately estimates the face and mouth location, taking into account different face poses, and generating a likelihood l V J (p i n,o ) to them, as no tracking-bydetection is performed. • The second block is based on a color histogram (color-based likelihood) l col (p i n,o ): The color histogram model is intended to be used when the first block fails to detect the face. These cases may appear with poses not correctly handled by the VJ model, for example, with the head tilted too far forward, or due to a person's rapid movement, which blurs facial features and may cause a poor response from the VJ model.

•
The third block is a foreground versus background segmentation (Fg./Bg. Segmentation) that generates a foreground likelihood (Fg. Likelihood) l f g (p i n,o ), which is used to validate the proposals from both previous models and restrict the video observation hypothesis dispersion, providing a l f g (p i n,o ) likelihood, when the other two components fail.
Thus, the general processing sequence of the video observation model l v (p i n,o ), shown in Figure 2, is as follows: • The first step is to apply the coordinate transformation from the World Cartesian coordinates System (WCS) to the Face Observation Space (FOS) one. There, the tilde inp i n,o refers to the projection of p i n,o in this FOS. Thus, after projection, the algorithm determines which observation hypotheses are in the camera's FoV. If neither the VJ nor the color models can be applied, the foreground likelihood will be assigned. • In any case, a mechanism to prevent a specific person's visual hypothesis from being confused with observations from another target, an occlusion detector is included (in the VJ and Color modules). Thus, a procedure to restrict the likelihood analysis from hypotheses declared as occluded (Occlusions Correction block) is applied. Following the observational model, a global visual likelihood function or video observation model l v (p i n,o ) is finally defined as in Equation (13), where a confidence level is assigned according to each model component.
If the measurement found with the appearance model It is then interesting to obtain further information about that face hypothesis that may give robustness to the validation process commented. This information is going to be the face hypothesis 2D BB size (S i n,o ). To determine it, the face height of a person h 3D r is assumed to be constant and known in the 3D WCS. Thus, it is projected to the FOS through the distance d i n,o , as shown in Equation (14): where f c is the camera focal length. Once the speakers position hypothesis {p i n,o } are projected from the WCS to the FOS, each particle represents the speaker mouth 2D location as Figure 3 shows a schematic view of the projection mechanism here described.

Appearance-Based Multi-Pose Video Observation Model
For the sake of clarity, the core of the video observation model based on the VJ likelihood, and its characteristics will be explained first. Then the rest of the processes will be detailed.

Viola and Jones Likelihood Model
The VJ likelihood evaluation is made using the probabilistic VJ model described in [44]. This model consists of modifying the standard VJ face detector to obtain face likelihood values. Given an image positionp i n,o , the model applies a cascade of face-trained classifiers returning a likelihood value Ω as in Equation (15): where κ is the number of stages that the image patch passes through the cascade of classifiers, M is the total number of stages, H m is the weight output by the stage m, and θ m is its threshold. The likelihood model is applied with three different templates to evaluate different possible face poses in yaw rotations. One template handles frontal faces, and the other two handle left and right profile faces, respectively. Therefore, within the 2D BB, the mouth position where the proposed visual likelihood model is applied is different for each template, as shown in Figure 4. The approach is flexible enough to allow for other poses to be considered, by, for example, extending the previous templates with in-plane rotations (roll). In this case, the image can be rotated with a given angle (α) in the opposite direction to allow reusing the already trained poses.
In this work, the image and the particles are rotated in both directions (clockwise and counterclockwise). For each direction, the models for the three face pose classifiers (front, right, and left profiles) are applied.
Thus, for each positionp i n,o in the image, nine response values are obtained, as shown in Equation (16): where F, R, and L refer to frontal, right, and left profiles, respectively, and their subindexes refer to the angle rotation. Figure 5 shows the templates for in-plane face rotations: three on the left with clockwise rotation, and three on the right with counterclockwise ones, using α = 15 • . The characteristics of the model response are explored in the FOS. Figure 6 shows the responses of the nine templates of the VJ likelihood model for three different face poses. The different templates better respond to faces with poses close to those with similar profiles and rotations. The likelihood responses have higher values around the mouth position for each pose, and more than one template can generate a positive response to the face when the pose is similar.  Figure 7 shows that the model response to a face is shaped like a blob in the FOS. This blob-shape-like behavior resembles that of Gaussian behavior. The width of the significant response levels in the (u, v) plane is small, a few pixels in diameter. The width in response to the template size S is much larger, several tens of pixels.

Exploration Mechanism of the FOS
One exploration alternative could be to evaluate the likelihood of the particle's projection on the FOS. However, the peak's width of the response on the image plane (u, v) is very narrow. This characteristic is right from the point of view of location accuracy, but it increases the possibility that no particle will hit the same region where a peak appears in the FOS. On the other hand, exploring the FOS domain in the whole area occupied by the particles may imply a high computational cost.
To ensure that the region with a peak response is found while keeping computational complexity low, we plan to take advantage of the model's face response's redundancy in dimension S. Our proposal consists of exploring the volume occupied by particles in the FOS using three slices in the S dimension. With the response in these slices, we can estimate a Gaussian shape of the likelihood in the FOS, to finally, weight all particles with the estimated Gaussian. The three slices or scanning planes are defined, one at the average face size, one γ times larger, and one γ times smaller than the average size: , the exploration is restricted to the rectangular area defined by the maximum and minimum particle position values in each dimension [u min : u max , v min : v max ]. Figure 8 shows the process to weight the particles.

Gaussian Approximation
For each pose, the VJ likelihood is evaluated in the three slices. Next, a threshold θ V J is applied to remove points outside the Gaussian pursuit, and the mean µp n,o = (µ u , µ v , µ s ) and covariance matrix Σp n,o of the remaining points are computed.
Next, instead of adding all pose outputs, like in [2,23], we select the best pose (pose best ) as that with the highest response value (described in Equation (17)), proven to reduce the number of false positives in preliminary experiments.
Using the centroid's value µp n,o , the corresponding pose face BB is projected to the Fg./Bg. image. Finally, poses whose BB have less than 60% of intersection with the foreground area are eliminated, as 60% is the approximated area percentage covered by a face in the templates. To prevent a zero variance in the S dimension, caused by the case where only one slice had points over the threshold, its minimum value is set to σ min ss,o , so that σ ss,o ≥ σ min ss . The right image in Figure 8, shows the particles (red), their mean value (black cross), and the standard deviation of the estimated Gaussian as a 3D surface (green).

Occlusion Detection with the VJ Model
After estimating which of the measurements are associated with each target, it is necessary to make sure that multiple targets are not related to the same measurement.
The following procedure is used to avoid this situation: If any of the targets have high dispersion, it may imply that some of the assigned measurements are from other targets. The threshold value to consider that a target has a high measurement dispersion is calculated as a fraction of the average face size of the particle setS o . • If the dispersion of any target exceeds this threshold and is not occluded, the Gaussian related to that target observation is recalculated after removing the assigned measurements within the ρ overlapping region. Then, the Euclidean distance between the two centroids of the observations reassigned to each target within the image plane is calculated as d uv = (µp n,o , µp n,o ) . • If this d uv distance is less than two times the dispersion in the image plane (2 · σ uu,o ) the decision of which measurement is associated with which target is based on another definition of the distance d uv is defined, using the representation of the predicted particles in the image plane as d uv = (µp n,o ,p n,o ) . Then again, the observations will be assigned to the target (o) from which the shortest distance d uv is found, and the other target (o ) is declared as occluded, and the related particles are at a shorter distance than the threshold.
The pseudocode of this process is presented in Algorithm 1. Figure 9 shows an occlusion situation. In the left graph, the image and the related locations (in the image plane) from the particles representing two targets are shown in red and green colors. The center graph shows a top view of the particle positions in the WCS.
Finally, the right-hand graph shows the same particles' related location in 2D together with the standard dispersion σ uu,o , through its representing circle (in magenta and cyan). There, it can be observed that part of the particles representing target number two (green) are being assigned in the observation space to target one (red). It can be thus noticed the particles dispersion, and therefore the occlusion situation.

VJ Likelihood Assignment
The likelihood values assigned to the particles are associated with the Gaussian with mean µp n,o , covariance matrix Σp n,o and amplitude equal to the maximum observed there Ω pose best (with pose best n,o as defined in Equation (17)). Likelihood values below the threshold θ V J are set to zero. Equation (18) describes such likelihood.

Head Color-Based Likelihood Block
As explained above, in the FOS coordinates, each particle is assigned a BB in the image, corresponding to the face in a frontal pose. To tackle face occlusions, the spatiogram of the 2D BB mouth estimation BB i n,o is calculated as in [17], and replicated in Equation (19). The likelihood will be proportional to the Bhattacharyya coefficient between the histogram associated with each particle i and the spatiogram of the reference face patch image BB f a n,o : were f a are the histogram count, the spatial mean and covariance matrices in the color bin b of particle i and in the reference face patch image f a, respectively.
The face is considered to be located by the color model (setting C col o = 1) if the maximum similarity value exceeds a threshold θ col . In this case, the 75th percentile of the lower likelihoods is discarded, to reduce the dispersion of particles due to the higher dispersion of the color model. Otherwise, we assume that the face is not detected by this color-based strategy (setting C col o = 0). Equation (20)  The color reference histogram is initialized in the face BB of the first ground-truth frame, considering a frontal pose. This reference model is updated in every iteration when the VJ likelihood delivers a confident value. In this case, the face BB corresponding to the best-detected pose is used.
Like in the VJ likelihood block, with the color model, particles from an occluded target o, very close to another target o , can be captured by the significant likelihood region of o . To avoid this situation, the o particles that are very close to o are labeled as occluded. The limit value of closeness is set equal to the diagonal of the average face size of the o target √ 2S o . This procedure is carried out before evaluating the color-based likelihood.

Correction on Occlusions
The occlusion correction stage is aimed at avoiding hypotheses of one person's position (target) obtaining a high likelihood if mistaken for another target. Simultaneously, this correction mechanism should not penalize hypotheses in an occluded region, thus allowing the mechanism to keep track of persons passing behind one another in the monitored environment.
The correction globally works as described in Algorithm 2. If a target is occluded and all its related position hypotheses are in an occluded area, all their likelihood values l v (p i n,o ) are set to 1/N P . Otherwise, if the target has some position hypotheses in an occluded area and some others in a non-occluded one, the likelihood of those located in the occluded area (occIdx(i) == 1) is set to the average of those in a non-occluded one (occIdx(i) = 1). The foreground vs. background segmentation procedure starts by subtracting a reference frame (with the environment background, without people) from the given one, in grayscale. After the subtraction, a threshold θ f g is applied to the resulting difference image, obtaining a binary image I θ, f g .
Hypotheses in the foreground or outside the camera's FoV receive a uniformly distributed likelihood value (U (p i n,o )). In contrast, those within the FoV but in the background, receive a zero weight, as stated in Equation (21):

Audio Observation Model
In this work, the audio observation model is based on a probabilistic version of SRP-PHAT, proposed in [55]. This model exploits a probabilistic interpretation of the Generalized Cross-Correlation with PHAT transform (GCC-PHAT) between the signals of each pair of microphones. With probabilistic GCC-PHAT, it is possible to associate only one correlation peak to each target. For each time step, the procedure works as follows: 1.
The GCC-PHAT for every pair of microphones, and its associated Gaussian model are obtained.

2.
SRP-PHAT is computed for every particle position.

3.
A Gaussian selection procedure chooses which Gaussian is associated with each target.

4.
Finally, the probabilistic SRP-PHAT value is associated with the likelihood of the targets. Figure 10 shows the general scheme of the audio observation model.

GCC-PHAT and Gaussian Model
The GCC-PHAT (to ease the notation, we will skip the explicit mention to PHAT when referring to the GCC PH AT and SRP PH AT functions in the equations) is computed for the signals arriving at each microphone pair π j (composed of microphones m j 1 and m j 2 ) around the evaluated video frame, as presented in Equation (22): where N f is the number of discrete frequencies used in the Fourier analysis of the discretized signals captured by the j 1 and j 2 microphones (sampled versions of the s j 1 (t) and s j 2 (t) signals); S j 1 [k] and S j 2 [k] are the frequency spectra of these signals; and is the PHAT filter. τ is the lag variable of the correlation function, associated with the time difference of arrival of the audio signal to the pairs of microphones.
The second step is to model the GCC-PHAT of each microphone pair π j as a Gaussian Mixture Model (GMM). The main assumption here is that each GCC-PHAT peak is caused by a different acoustic source at a given position, generated by the direct propagation path, by a reverberant echo, or by other noise sources. With this consideration, each peak in the GCC-PHAT function is associated with a Gaussian function in the GMM model described by Equation (23): where N J is the number of peaks detected in the GCC-PHAT function, and µ h π j , σ h π j and ω h π j represent the mean, standard deviation and weights of the h th component of the mixture.
The correlation values are first normalized (making their sum equal to one) and their negative values are set to zero. The GMM parameters {µ h π j , σ h π j , ω h π j } N J h=1 are estimated according to the procedure described in [55].

SRP-PHAT and Gaussian Selection
Once the GMM model is available, the traditional SRP-PHAT formulation can be applied to the position of each particle. Then, the SRP-PHAT value for a given target o is calculated as the average SRP over all the particles associated with o, Ę SRP n,o . Gaussian selection is applied sequentially to every target, starting from the one with the highest Ę SRP n,o value. For each target o and microphone pair π j , the set of maximum } is evaluated in each Gaussian h, and that with the highest value is selected, as shown in Equation (24).
After the selection, the selected Gaussian is subtracted from the mixture, and the Gaussian selection process continues with all the other targets, in Ę SRP n,o value decreasing order. Figure 11 represents the Gaussian selection process in a frame extracted from sequence seq18-2p-0101 in the AV16.3 dataset, where there were two close speakers. The graphic on the left shows in black the GCC π 16 (τ) function (for the π 16 microphone pair) and the calculated Gaussian mixture GCC π 16 (τ) in blue, along with the projections of the particles with maximum SRP p i n,o for both targets. The graphic on the right highlights the selected Gaussians for each target. In both graphics, target one selected Gaussian appears in red, and target two appears in green.  Figure 11. Gaussian selection. In the left graphic: GCC π 16 (τ) in black, and GCC π 16 (τ) in blue. In the right graphic: selected Gaussians for target 1 (red) and target 2 (green). The GMM model was generated with N J = 7.
From Figure 11 it can be observed that both targets obtain close TDoA projection for their particles with maximum SRP(p i n,o ) value, sharing the same Gaussian. In this case, target 2 obtained the highest value, so it was assigned the associated Gaussian.

Single Gaussian SRP-PHAT Model
The final step to generate a likelihood value from the acoustic information consists of simplifying the SRP-PHAT GMM model, by considering just one Gaussian for each pair of microphones, that with the highest weight value, as shown in Equation (25).
where ω * π j , µ * π j and σ * π j are the parameters associated with the Gaussian component with the highest weight value in Equation (23).
Because of reverberation and low SNR conditions, some speech segments may exhibit low SRP-PHAT values, degrading the quality of the acoustic power maps. To avoid this degradation, we consider the maximum SRP-PHAT value as an indicator of confidence, so that a threshold θ a is applied to limit the influence of such segments. Finally, the likelihood from the audio information is calculated as described in Equation (26):

Experimental Setup
The tracking system has been evaluated in three modalities. The first one uses only audio information, the second one uses only video information, and the final one combines the two sources of information in an audiovisual modality.

Datasets
The databases used for system evaluation are the well-known AV16.3 [45], and CAV3D [17] in the state of the art of interest. Both databases are fully labeled, providing the mouth ground-truth location. Synchronization information between the audio and video streams is also available.

Sequences Selection
The experiments were carried out using the same sequences in AV16.3 and CAV3D evaluated in [17,22], two state-of-the-art proposals used for performance comparison. Table 1 shows the AV16.3 selected sequences: seq08, seq11, seq12 for SOT, and seq18, seq19, seq24, seq25 seq30 for MOT. For each sequence, the three cameras' views were tested independently and combined with the first microphone array, giving a total of nine evaluation sequences in SOT and fifteen in MOT. Table 2 shows the CAV3D selected sequences, which are all the available ones for the SOT and MOT cases, in which we focused our experimental work.

Evaluation Metrics
The evaluation metrics used are the Track Loss Rate (TLR), and the Mean Absolute Error (MAE), as defined in [17]. The TLR is the percentage of frames with a track loss, where a target is considered to be lost if the error exceeds a given threshold. These metrics will be further specified below.
The MAE for 3D is expressed in m as in Equation (27): where N F is the number of frames evaluated, andp n,i and r n,i are, respectively, the estimated and real positions of source i in frame n.
To evaluate the TLR in 3D, a target is considered to be lost if the error with respect to the ground-truth is larger than 300 mm. We also use a fine error metric defined as 3D , where only the frames where tracking is successful are considered in Equation (27).
For 2D, the MAE in the image plane is expressed in pixels as in Equation (28): ||p n,i −r n,i ||, (28) where N S are N F are the number of sources and frames, respectively, in which the source position is inside the camera's FoV.p n,i andr n,i are, respectively, the projection ofp n,i and r n,i in the image plane.
Moreover, for computing the TRL in 2D it is used a threshold of 1/30 of the image diagonal in pixels. As in the 3D case, a fine error metric 2D is also defined, for this 2D TLR threshold.
When comparing different proposals or experimental conditions, we will also calculate the relative performance improvement in all the evaluated metrics, as follows: where Alg1 and Alg2 refer to the algorithms or conditions we are comparing, and Metric Alg refers to the considered Metric calculated using the corresponding Alg. Given that in all the proposed metrics the lower the better, a positive result for ∆ Alg2 Alg1 implies that Alg2 is better than Alg1.

System Configuration
In AV16.3, the audio signals were resampled up to 96 kHz. Also in this database, each image frame was scaled by 2 to adapt them to the VJ OpenCV [62] templates (20 × 20 pixels) when they are far away from the camera. Also, a lens distortion correction was applied.
For both AV16.3 and CAV3D data, the audio signal pre-processing starts with a pre-emphasis filter (H(z) = 1 − 0.98z −1 ) to enhance high-frequency content. After filtering a segment of 8192 samples (≈ 85.3 ms), flattop weighted windows are applied to the signal, with a window shift value equal to the video frame rate (25 f.p.s. in AV16.3 and 15 f.p.s. in CAV3D). Thus, there is one audio segment associated with each video frame. Moreover, the Fourier filter size has been selected equal to the signal window size (8192).
Regarding the algorithm parameters needed in the proposal as described in Sections 5 and 6, all of them were tuned on a small subset composed of additional sequences from the AV16.3 dataset, except for the Fg./Bg. Segmentation, in which the tuning was carried out using the extra sequence seq21 from CAV3D.
These are the final values in the experimental setup: The face size in 3D was set to a fixed value of h 3D r = 17 cm. The appearance-based likelihood threshold θ V J , was set to 0.5. The γ face size change factor for slice exploration was set to γ = 1.2. The minimum dispersion in the S dimension was fixed to σ min ss = 10. The color spatiogram likelihood threshold was set to θ col = 0.6. The Fg./Bg. Segmentation gray scale intensity threshold was set to θ f g = 80 for AV16.3 and θ f g = 30 for CAV3D. The acoustic power threshold θ a was set to 0.8. The model parametersv = 1 ms −1 and β = 10 s −1 used in [60,63] have shown good results, and thus they are the ones we applied in this work. The PF algorithm used was the SIR with N P = 1000 particles per target.

Results and Discussion
In this section, different results, both quantitative and qualitative, are included to demonstrate the contributions of the GAVT global proposal and its processes in the audiovisual MOT objective.
As mentioned in Section 2, the AV16.3 and CAV3D datasets are used for the experimental work, due to their application in the most important state-of-the-art related works [17,20].
Therefore two different sections are here included to analyze such results: a first one in which the multimodality of the proposal is evaluated and discussed; and a second one in order to compare them with the best proposals in the state-of-the-art on the selected datasets [20].
In the tables within this section, we will provide values for TLR, 2D , 2D , 3D , 3D , segregated for SOT and MOT partitions, and their average value. In all cases, we explicitly include the average of the standard deviation of the metric (±σ), given that we carried out ten runs per sequence due to the probabilistic nature of the GAVT proposal. We will also include information on the modality being used, either audio only (A for short), video only (V for short), or audiovisual (AV for short). Finally, to quickly identify the experimental conditions of each table with results, in the first row we state the dataset used (AV16.3 or CAV3D), and if the metrics are in the image plane (2D) of in the three-dimensional space (3D), also including the modality being used. The comparison between modalities and algorithms will be done using the ∆ alg2 alg1 metric defined in the previous section. In all cases, the best results across metrics will be highlighted with a green background in the corresponding cell.

Audiovisual Combination Improvements
We first present the contribution of audiovisual tracking versus individual audio-only and video-only modalities on the AV16.3 database sequences. We did not make this comparison in the CAV3D dataset as in that one there are several sequences in which the speakers leave the camera's FoV for a certain time, so that the video-only modality could not be compared equally with the other two modalities within such dataset.
The results of our GAVT proposal using the audio-only (A), video-only (V) and audiovisual (AV) modalities are shown in Tables 3 and 4 for the 2D and 3D metrics, respectively, as well as the relative performance improvements comparison from A to AV modalities (∆ AV A ), and from V to AV (∆ AV V ).  The obtained results clearly show that for both the SOT and MOT tasks, the audiovisual combination outperforms its monomodal counterparts. As expected, the visual modality is far better than the audio modality, but in all cases, the audiovisual combination contributes to improved results.
In the 2D case, the average MAE is strongly improved when combining the audio and video modalities, with similar improvements for the SOT and MOT tasks (86.2% and 87%, respectively), with an average improvement of 86.7%. The improvements of the audiovisual modality as compared with the video-only one are still relevant, being 37.6% on average. When considering the fine MAE, the audiovisual modality does not improve the visual-only modality, with minor degradation of −2.0% on average, which is not significant (especially considering the non-linear characteristic of this metric), with errors below 3.35 pixels in all cases.
In the 3D case, the improvements of the audiovisual modality are consistent across all metrics, with MAE of 18 cm on average, and very relevant improvements of up to 50.03% as compared with the visual-only modality.
Also as expected, the results for the SOT task are better than those for the MOT task. For example, in the audiovisual modality, with SOT 2D MAE of 5.12 pixels vs. MOT 2D MAE of 7.4. In the 3D case, the 3D MAE for SOT is 15 cm vs. 21 cm in the MOT task.
For the SOT task, the highest relative improvement in the comparison between audiovisual and audio-only modalities was found for sequence seq11 camera 2, with a 79%, and the lowest relative improvement happened for sequence seq08 camera 1 with a 25%. We will further discuss these two extreme cases.
The top part of Figure 12 shows the mean (dark line) and standard deviation (light color area) of the 3D error over time for the seq11 sequence camera 2. The bottom part of Figure 12 shows the top view of the tracked trajectories for each of the modalities.
Analyzing the errors in 2D (Table 3), where the tracked positions are reprojected to the camera image plane, we can see that the video-only modality is very accurate, with an error of 5.78 pixels, and close to the audiovisual solution with 5.12 pixels, while the audio-only modality presents a much higher error of 37.9 pixels.
From comparing 2D and 3D errors, we can observe that the video modality errors come principally from the estimation of depth (see the bottom part of Figure 12), so that we can interpret that the proposed multi-pose face model can accurately locate the mouth in the image plane domain.
From the results of sequence seq11 camera 2 in Figure 12, it can be observed that the audio modality tracker presents significant errors for most of the sequence. This error can be observed in the trajectories (see bottom left graphic in Figure 12) where the system overestimates the distance from the target to the microphone array, especially when the target is far from it. The video-only tracker presents a better estimation in the first part of the sequence but underestimates the distance at the end of the sequence (see bottom middle graphic in Figure 12) when the target is farthest from the camera. Observing the audiovisual trajectories (see bottom right graphic in Figure 12), both modalities compensate for each other errors, presenting an intermediate estimation that leads to improved results. Figure 13 shows the detailed results for sequence seq08 camera 1, where it can be observed that the audio-only modality present low errors during the whole sequence (bottom left graphic), while the video-only modality fails on the estimation of depth with important errors (bottom middle graphic). Although the video-only modality presents more relevant errors than the audio-only one, the audiovisual combination improves again the monomodal tracking options.  This behavior of audiovisual integration usually works successfully even when the video performance is lower than the audio. As an example, the upper graphic of Figure 14 shows the 3D MAE variation along time for sequence seq08 camera 3, where the video modality shows higher errors than the audio-only modality, especially in the first half of the sequence. The audiovisual fusion is again able to obtain results that improve the monomodal versions. The lower graphic of Figure 14 shows another case in which the audio modality presents higher errors than the video modality in the central part of the sequence, and the audiovisual fusion compensates for such behavior. For the MOT task, the audio modality significantly increases its errors, up to 68 cm on average, while the video modality obtained a result of 39 cm. This performance decrease in the audio tracking modality could be explained by targets being interchanged between them.
Regarding the detailed analysis for the MOT task, Figure 15 shows an example for sequence seq24 camera 3. The top part of Figure 15 shows the mean (dark line) and standard deviation (light color area) of the 3D error over time for the seq24 sequence camera 3. The bottom part of Figure 15 shows the top view of the tracked trajectories for each of the modalities. In this sequence, there were two speakers, so two sets of graphics are shown.
In this case, the audio-only modality (bottom left graphics of Figure 15) only shows good results in the initial part of the sequence. The video-only modality (bottom middle graphics of Figure 15) exhibits good results for speaker 2, but not so good for speaker one, but the combination of both modalities (bottom right graphics of Figure 15) can compensate the errors, achieving good results even for speaker 1.
In the MOT task, the average 3D errors are 40% higher compared to the SOT case. We observed that most large errors were associated with instances where one target was lost because of a missing face and the difficulty to recover it, similar to the difficulties encountered in the SOT task. Nevertheless, our occlusion detection and handling mechanisms proved effective in preventing particle interchanges, leading to good results. As an example, in sequence seq22 (see Figure 16), we can see the successful tracking of two targets walking in circular patterns in front of the camera.
We can also find examples of audiovisual integration not being able to successfully compensate for the impact of a modality performing badly. Figure 17 illustrates such a scenario, where the system correctly integrates the audio and video modalities for speaker 1 (top graphic), resulting in improved tracking accuracy. However, for speaker 2 (bottom graphic), the integration improves the results as compared with the video-only modality, but grows worse as compared with the audio-only modality toward the end of the sequence.

Comparison with the State-of-the-Art
In this section, we will compare our GAVT proposal with that of [17] (which we will refer to as AV3T) that represents state-of-the-art performance in the AV16.3 and CAV3D datasets within our experimental design. Tables 5 and 6 show the 2D and 3D performance metrics of the GAVT and AV3T systems on the AV16.3 dataset, for the audio-only (A) and video-only (V) modalities. The tables also include the relative performance improvements of GAVT as compared with AV3T (∆ GAVT AV3T ). From these results, it is clear that our audio-only method performs worse than that proposed by Qian since we did not use height information from the video (−51% performance improvement for the average 2D MAE and −38% for the average 3D MAE).
It's also clear that our video-only tracker outperforms AV3T in the MAE metrics (23% for both the average 2D and 3D MAEs). This result shows that our proposed pose-dependent visual appearance model works better to localize the mouth than using the generic face detection with a mouth position estimation based on the aspect ratio proposed in [17].  Tables 7 and 8 show again, for the AV16.3 dataset, the 2D and 3D performance metrics for the audiovisual modality (AV) of the GAVT and the AV3T systems, as well as the improvements of our proposal against AV3T (∆ GAVT AV3T ). Our audiovisual method GAVT presents significant improvements as compared with AV3T, for both the 2D and 3D metrics, and for both the SOT and MOT tasks. More important error reductions were found in the 2D MAE metrics (up to almost 70% relative improvement in 2D MAE), while the average relative improvement in 3D MAE is 4.2% at the expense of a performance decrease of −9.4% for the 3D TLR.
The results on the CAV3D dataset for the full system with audio and video integration are presented in Tables 9 and 10. The performance improvements we achieve are clear for all the 3D metrics and for both the SOT and MOT tasks, reaching an average of 11.5% in the 3D MAE. However, for the 2D metrics, we are only able to improve the average MAE metrics, up to 2.6% in 2D MAE, with improvements in MOT, but not in SOT. It is worth mentioning here that the most relevant metric for our purposes is the 3D MAE, as we are interested in precise 3D localization in the given environment. It is worth mentioning that our method does not use a face detector and in a context where speakers do not look at the camera for a while, go out of the FoV, or there may be no visible face, particles may move away from the actual target (lose the target) and it is more difficult to recover the location of the speaker. In other words, without a face detector, which scans each frame of the entire image, it is more difficult to recover from a target loss. Despite that, our proposal successfully solves the audiovisual 3D localization task, improving the AV3T system in 3D MAE performance for both the AV16.3 and CAV3D datasets.

Limitations of the GAVT Proposal
We consider that the two databases, AV16.3 and CAV3D, cover a wide range of situations that can be encountered in an intelligent space (different environment sizes, sensor configurations, camera resolutions, speakers moving in and out of the camera's FoV, etc.). However, there are characteristics of GAVT that may limit its performance in additional situations that may arise in a real-world context.
We can first refer to the presence of different head sizes, which will impact the estimation of the distance from cameras to the speaker: For example, smaller heads (such as children's) will lead to an overestimation of this distance (given our assumption of an average adult head size).
We can also consider the effect of larger environments, which may first increase the number of possible speakers, thus increasing the difficulty of the task; and second increase the occurrence of situations in which any speaker will not be visible in the camera's FoV, mainly if the remain silent for long periods of time. In these two situations, the algorithm will have no information to track for a long time, and the particles may drift in the direction from the speaker's last known position. In these cases, a detection mechanism would be necessary, based on audio and/or video information, to reinsert particles once the speaker starts talking or appears in the camera's FoV. Larger environments and a higher number of speakers will also require an increase in the particle number, which at its time, may lead to a relevant increase in computational complexity.

Conclusions and Future Work
In this paper, we have proposed a robust and precise system to track a known number of multiple speakers in a 3D smart space, combining audio and video information. The system uses particle filters with an ad hoc designed audiovisual probabilistic observation model. The visual likelihood model is based on VJ detectors with a pose-dependent strategy that improves the mouth location estimation in 2D and 3D. Additionally, we adopt a specific mechanism to handle MOT tasks and avoid target interference by exploiting the likelihood dispersion effect. The audio likelihood model uses a probabilistic version of SRP, adopting a refined peak selection strategy to avoid target interference, based on the joint distribution of all pairs of microphones. The final fusion model assumes statistical independence of both modalities, so that the audiovisual probability results from the product of audio and video probability density functions.
In the AV16.3 dataset, our audiovisual system proposal shows average relative improvements in 2D mouth localization, of 86.7% and 37.6% over the audio and visual counterparts, respectively. In 3D localization, the improvements were 66.1% and 50.3%. This demonstrates that the proposed audiovisual likelihood combination significantly improves the monomodal counterparts tracking results. When compared to the stateof-the-art, in 2D and 3D metrics, the proposed system presents improved results in the visual and audiovisual modalities for both the SOT and MOT tasks. In the AV16.3 dataset, the 2D average relative improvement is 23% for the visual modality and 69.7% for the audiovisual case, while the 3D improvements are 23% for the visual modality and 4.2% for the audiovisual one. For the audiovisual modality and the CAV3D dataset, the 2D average improvement is 2.6%, while the 3D improvements are 11.5%, rising to 18.1% in the more difficult MOT task. The better results in 2D show that the proposed pose-dependent face model gives a better adapted likelihood of finding the mouth inside the face, in the image plane, as compared with state-of-the-art proposals. The 3D results are consistently better.
The most important errors in the experiments here described are derived from bad depth estimations and target recovery with the visual likelihood model after a target is lost. As a global conclusion, the audiovisual system here proposed and described has been demonstrated to successfully handle occlusions for MOT tasks, and significantly improve state-of-the-art results in a challenging audiovisual tracking context like CAV3D.
In future work, we plan to focus on new alternatives to decrease the depth estimation errors, being the main error source. For this purpose, we plan to improve the head pose information. Rather than a discrete set of possible poses, we will evaluate a continuous estimation of the three head pose angles using recent deep learning-based head pose estimators. The challenge in this case will be the use of head pose estimation algorithms in low-resolution faces, present in the AV16.3 sequences. We will also combine the proposed likelihood model with a face detector, to solve target losses. To deal with long target losses, we will include a birth and death hypotheses mechanism that will also help to handle tracking a variable and an unknown number of speakers. Institutional Review Board Statement: Ethical review and approval were waived for this study, as we have used two datasets that were made available by their authors to the scientific community.

Informed Consent Statement:
We did not require users consent as we have used two datasets that were made available by their authors to the scientific community, so that we were not involved in the data acquisition processes.

Data Availability Statement:
Publicly available datasets were used in this study. The AV16.3 data can be found at https://www.idiap.ch/en/dataset/av16-3 and the CAV3D data can be requested at https://speechtek.fbk.eu/cav3d-dataset.